106
Binary Neural Architecture Search
Discrepant
Child search
Tangent propagation
based on Parent
Child
Child
Decoupled
optimization
Parent
Parent
FIGURE 4.11
The main framework of the proposed DCP-NAS, where α and ˆα denote real-valued and
binary architecture, respectively. We first conduct the real-valued NAS in a single round
and generate the corresponding tangent direction. Then we learn a discrepant binary ar-
chitecture via tangent propagation. In this process, real-valued and binary networks inherit
architectures from their counterparts, in turn.
where ⊗is the convolution operation. We omit the batch normalization (BN) and activation
layers for simplicity. Based on this, a normal NAS problem is given as
max
w∈W,α∈A f(w, α),
(4.20)
where f : W ×A →R is a differentiable objective function w.r.t. the network weight w ∈W
and the architecture space A ∈RM×E, where E and M denote the number of edges and
operators, respectively. Considering that minimizing f(w, α) is a black-box optimization,
we relax the objective function to ˜f(w, α) as the objective of NAS
min
w∈W,α∈A LNAS = −˜f(w, α)
= −
N
n=1
pn(X) log(pn(w, α)),
(4.21)
where N denotes the number of classes and X is the input data. ˜f(w, α) represents the
performance of a specific architecture with real value weights, where pn(X) and pn(w, α)
denote the true distribution and the distribution of network prediction, respectively.
Binary neural architecture search The 1-bit model aims to quantize ˆw and ˆain into
b ˆw ∈{−1, +1}Cout×Cin×K×K and bˆain ∈{−1, +1}Cin×H×W using the efficient XNOR and
Bit-count operations to replace full precision operations. Following [48], the forward process
of the 1-bit CNN is
ˆaout = β ◦bˆain ⊙b ˆw,
(4.22)
where ⊙is the XNOR, and bit count operations and ◦denotes channelwise multiplication.
β = [β1, · · · , βCout] ∈R+
Cout is the vector consisting of channel-wise scale factors. b =
sign(·) denotes the binarized variable using the sign function, which returns one if the
input is greater than zero and −1 otherwise. It then enters several non-linear layers, e.g.,